Chapter 1: Intro

ex 1.1. List five differences between raster and vector data.

  • raster data have values for pixels, vector data for points, lines of polygons
  • spatial locations of raster pixels are constrained to a grid, vector data coordinates can have arbitrary locations (only limited by floating point representation of coordinates)
  • raster data lend themselves well to represent spatially continuously observed variables (such as imagery) or spatially continuously varying variables (such as elevation or temperature); vector data lend themselves well to represent spatially discrete features such as houses and roads, or administrative regions.
  • raster data cover their spatial extent completely: every point is part of a single pixel; vector data may contain holes, or have intersecting geometries where points belong to multiple polygons.
  • the operations on raster data are often simple mathematical (raster algebra) operations that include spatial operations; such simple operations are usually not available for vector data.
  • raster data has trivial topology: it is clear which 4 or 8 pixels are the neighbours of every pixel; for vector data spatial there are more types of relationships, and these relationships are more complicated to detect.

The answer “Raster data is continuous data while vector data is discrete data.” is not complete: a raster of land use type represens a discrete (type) variable, a polygon map with population density represents a continuous variable. The difference lies in spatially continuous variables like elevation or temperature which are more easily represented by raster data, and spatially discrete features, such as houses and roads, which are easier represented by vector data.

ex 1.2. In addition to those listed below figure 1.1, list five further graphical components that are often found on a map.

  • scale bar
  • data source
  • well defined title, subtitle
  • orientation indicator, north arrow
  • further reference elements: seas, land mass, rivers

ex 1.3. Why the numeric information shown in figure 1.4 misleading (or meaningless):

The values shown in figure 1.4 are population total associated with their respective counties. Without the county boundaries the meaning disappears: raster pixels do not contain population totals per pixel, population totals over larger regions or populations densities can no longer be derived based on this raster map alone.

ex 1.4. Under which conditions would you expect strong differences when doing geometrical operations on \(S^2\), compared to doing them on \(R^2\)

  • when computing distances between two points at large distance from each other
  • when determining what the shortest line is between two points, in particular near to the poles, or when the antimeridian crosses this line

Chapter 2: Coordinates

ex 2.1. list three geographic measures that do not have a natural zero origin

  • longitude: the zero meridian is arbitrary, 100 years ago there were many other zero meridians fashionable
  • latitude: the equator may feel like a natural zero, but one could equally use the North Pole as zero, or choose entirely different origins and orientation for longitude and latitude.
  • altitude (measured w.r.t. mean sea level, geoid, or ellispoid)

ex 2.2 - 2.4

(thanks to Jonas Hurst)

Convert the (x, y) point s (10, 2), (-10, -1), (10, -2) and (0, 10) to polar cooridnates

cart2polar = function(x, y){
  r = sqrt(x*x + y*y)  # compute r (distance from origin)
  phi = atan2(y, x)  # compute phi (angle between point and positive x axis in rad)
  phi_deg = phi * 180 / pi  #  compute angle in deg
  result = c(r, phi_deg)
  return(result)
}

cart2polar(10, 2)
## [1] 10.19804 11.30993
cart2polar(-10, -1)
## [1]   10.04988 -174.28941
cart2polar(10, -2)
## [1]  10.19804 -11.30993
cart2polar(0, 10)
## [1] 10 90

Convert from Polar to Cartesian

Convert the polar (r, phi) points (10, 45°), (0, 100°) and (5, 259°) to Cartesian coordinates

deg2rad = function(angle_degree) {
  angle_degree * pi / 180
}

polar2cart = function(r, phi_deg){
  # phi must be in degrees
  phi_rad = deg2rad(phi_deg)  # convert phi in degrees to radians
  x = r * cos(phi_rad)
  y = r * sin(phi_rad)
  c(x, y) # return value
}

polar2cart(10, 45)
## [1] 7.071068 7.071068
polar2cart(0, 100)
## [1] 0 0
polar2cart(5, 259)
## [1] -0.954045 -4.908136

assuming the Earth is a sphere with a radius of 6371 km, compute for (lambda, phi) points the great circle distance between (10, 10) and (11, 10), between (10, 80) >and (11, 80), between (10, 10) and (10, 11) and between (10, 80) and (10, 81).

distOnSphere = function(l1, phi1, l2, phi2, radius) {
  l1_rad = deg2rad(l1)
  l2_rad = deg2rad(l2)
  phi1_rad = deg2rad(phi1)
  phi2_rad = deg2rad(phi2)

  theta = acos(
    sin(phi1_rad) * sin(phi2_rad) +
    cos(phi1_rad) * cos(phi2_rad) * cos(abs(l1_rad - l2_rad))
  )
  radius * theta # return value
}

radius = 3671
distOnSphere(10, 10, 11, 10, radius)
## [1] 63.09763
distOnSphere(10, 80, 11, 80, radius)
## [1] 11.12568
distOnSphere(10, 10, 10, 11, radius)
## [1] 64.07104
distOnSphere(10, 80, 10, 81, radius)
## [1] 64.07104

Unit of all results are kilometers.

Chapter 3: Geometries

(thanks to Jannis Fröhlking)

ex 3.1 Give two examples of geometries in 2-D (flat) space that are not simple feature geometries, and create a plot of them.

library(sf)
## Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
x1 <- st_linestring(rbind(c(0,0),c(2,2),c(0,2),c(2,0)))
x2 <- st_polygon(list(rbind(c(3,0),c(5,2),c(3,2),c(5,0),c(3,0))))
plot(c(x1,x2), col = 2:3)

st_is_simple(x1)
## [1] FALSE
st_is_simple(x2)
## [1] FALSE

ex 3.2 Recompute the coordinates 10.542, 0.01, 45321.789 using precision values 1, 1e3, 1e6, and 1e-2.

for(i in c(1,1e3,1e6,1e-2)) 
  print(round(i * c(10.542, 0.01, 45321.789))/i)
## [1]    11     0 45322
## [1]    10.542     0.010 45321.789
## [1]    10.542     0.010 45321.789
## [1]     0     0 45300

ex 3.3 Describe a practical problem for which an n-ary intersection would be needed.

  • for a long-term set of polygons with fire extents, find the polygons that underwent 0, 1, 2, 3, … fires
  • for a set of extents of n individual plant species, find polygons with 0, 1, …, n species, or find the polygon(s) that contain a particular subset of plant species.

ex 3.4 How can you create a Voronoi diagram (figure 3.3) that has closed polygons for every point?

Voronoi diagrams have “open polygons”, areas that extend into infinity, for boundary points. These cannot be represented by simple feature geometries. st_voronoi chooses a default (square) polygon to limit the extent, which can be enlarged. Alternatively, the extent can be limited using st_intersection on its result:

library(sf)
par(mfrow = c(2,2))
set.seed(133331)
mp = st_multipoint(matrix(runif(20), 10))
plot(st_voronoi(mp), col = NA, border = 'black')
plot(mp, add = TRUE)
title("default extent")
e2 = st_polygon(list(rbind(c(-5,-5), c(5, -5), c(5,5), c(-5, 5), c(-5,-5))))
plot(st_voronoi(mp, envelope = e2), col = NA, border = 'black')
plot(mp, add = TRUE)
title("enlarged envelope")
e3 = st_polygon(list(rbind(c(0,0), c(1, 0), c(1, 1), c(0, 1), c(0, 0))))
v = st_voronoi(mp) %>% st_collection_extract() # pulls POLYGONs out of GC
plot(st_intersection(v, e3), col = NA, border = 'black', axes=TRUE)
plot(mp, add = TRUE)
title("smaller, intersected envelope")

ex 3.5 Give the unary measure dimension for geometries POINT Z (0 1 1), LINESTRING Z (0 0 1,1 1 2), and POLYGON Z ((0 0 0,1 0 0,1 1 0,0 0 0))

st_dimension(st_point(c(0,1,1)))
## [1] 0
st_dimension(st_linestring(rbind(c(0,1,1),c(1,1,2))))
## [1] 1
st_dimension(st_polygon(list(rbind(c(0,0,0),c(1,0,0),c(1,1,0),c(0,0,0)))))
## [1] 2

(these are all zero-dimensional geometries because they are points, irrespective the number of dimensions they’re defined in)

ex 3.6 Give the DE-9IM relation between LINESTRING(0 0,1 0) and LINESTRING(0.5 0,0.5 1); explain the individual characters.

line_1 = st_linestring(rbind(c(0,0),c(1,0)))
line_2 = st_linestring(rbind(c(.5,0),c(.5,1)))
plot(line_1,col = "green")
plot(line_2,col = "red", add = TRUE)

st_relate(line_1, line_2)
##      [,1]       
## [1,] "F01FF0102"

The DE-9IM relation is F01FF0102

  • F Intersection of green lines interior and red lines interior is empty
  • 0 Intersection of green lines interior and red lines boundary results in one point in the middle of the green line
  • 1 Intersection of green lines interior and red lines exterior results in a line covering most parts of the green line
  • F Intersection of green lines boundary and red lines interior is empty
  • F Intersection of green lines boundary and red lines boundary is empty
  • 0 Intersection of green lines boundary and red lines exterior results in the two boundary points of the green line
  • 1 Intersection of green lines exterior and red lines interior results in a line covering most parts of the red line
  • 0 Intersection of green lines exterior and red lines boundary results in the upper boundary point of the red line
  • 2 Intersection of green lines exterior and red lines results in a polygonal geometry covering everything except the two lines

(the boundary of a LINESTRING is formed by its two end points)

ex 3.7 Can a set of simple feature polygons form a coverage? If so, under which constraints?

Yes, but I would say that the set may just contain one polygon, because simple features provide no way of assigning points on the boundary of two adjacent polygons to a single polygon.

ex 3.8 For the nc counties in the dataset that comes with R package sf, find the points touched by four counties.

# read data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
## Reading layer `nc' from data source 
##   `/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/shape/nc.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS:  NAD27
# get intersections
(nc_geom = st_geometry(nc))
## Geometry set for 100 features 
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS:  NAD27
## First 5 geometries:
## MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3...
## MULTIPOLYGON (((-81.23989 36.36536, -81.24069 3...
## MULTIPOLYGON (((-80.45634 36.24256, -80.47639 3...
## MULTIPOLYGON (((-76.00897 36.3196, -76.01735 36...
## MULTIPOLYGON (((-77.21767 36.24098, -77.23461 3...
nc_ints = st_intersection(nc_geom)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
plot(nc_ints, main = "All intersections")

# Function to check class of intersection objects
get_points = function(x){
  if(class(x)[2]=="POINT")  return(x)
}
# get points
points = lapply(nc_ints, get_points)
points[sapply(points,is.null)] <- NULL
sf_points = st_sfc(points)
st_crs(sf_points) = st_crs(nc)
# get points with four neighbouring geometries (=states)
touch = st_touches(sf_points, nc_geom)
four_n = sapply(touch, function(y) which(length(y)==4))
names(four_n) = seq_along(four_n)
point_no = array(as.numeric(names(unlist(four_n))))
result = st_sfc(points[point_no])
plot(nc_geom, main = "Points touched by four counties")
plot(result, add = TRUE, col = "red", pch = 10, cex = 2)

A more compact way might be to search for points where counties touch another county only in a point, which can be found using st_relate using a pattern:

(pts = nc %>% st_relate(pattern = "****0****"))
## although coordinates are longitude/latitude, st_relate_pattern assumes that they are planar
## Sparse geometry binary predicate list of length 100, where the
## predicate was `relate_pattern'
## first 10 elements:
##  1: (empty)
##  2: (empty)
##  3: (empty)
##  4: (empty)
##  5: (empty)
##  6: (empty)
##  7: (empty)
##  8: (empty)
##  9: 31
##  10: 26
nc %>% st_relate(pattern = "****0****") %>% lengths() %>% sum()
## although coordinates are longitude/latitude, st_relate_pattern assumes that they are planar
## [1] 28

which is, as expected, four times the number of points shown in the plot above.

How can we find these points? See here:

nc = st_geometry(nc)
s2 = sf_use_s2(FALSE) # use GEOM geometry
## Spherical geometry (s2) switched off
pts = st_intersection(nc, nc)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
pts = pts[st_dimension(pts) == 0]
plot(st_geometry(nc))
plot(st_geometry(pts), add = TRUE, col = "red", pch = 10, cex = 2)

sf_use_s2(s2) # set back
## Spherical geometry (s2) switched on

ex 3.9 How would figure 3.6 look like if delta for the y-coordinate was positive?

Only cells that were fully crossed by the red line would be grey:

library(stars)
## Loading required package: abind
ls = st_sf(a = 2, st_sfc(st_linestring(rbind(c(0.1, 0), c(1, .9)))))
grd = st_as_stars(st_bbox(ls), nx = 10, ny = 10, xlim = c(0, 1.0), ylim = c(0, 1),
   values = -1)
attr(grd, "dimensions")$y$delta = .1
attr(grd, "dimensions")$y$offset = 0 
r = st_rasterize(ls, grd, options = "ALL_TOUCHED=TRUE")
r[r == -1] = NA
plot(st_geometry(st_as_sf(grd)), border = 'orange', col = NA, 
     reset = FALSE, key.pos=NULL)
plot(r, axes = TRUE, add = TRUE, breaks = "equal") # ALL_TOUCHED=FALSE;
plot(ls, add = TRUE, col = "red", lwd = 2)

The reason is that in this case, lower left corners of grid cells are part of the cell, rather than upper left corners.

Chapter 4: Spherical geometry

ex 4.1 Straight GeoJSON lines

How does the GeoJSON format define “straight” lines between ellipsoidal coordinates (section 3.1.1)? Using this definition of straight, how would LINESTRING(0 85,180 85) look like in a polar projection? How could this geometry be modified to have it cross the North Pole?

GeoJSON defines straight lines between pairs of ellipsoidal coordinates as the straight line in Cartesian space formed by longitude and latitude. This means e.g. that all parallels are straight lines.

Using this definition of straight, how would LINESTRING(0 85,180 85) look like in a polar projection?

Like a half circle:

library(sf)
l <- st_as_sfc("LINESTRING(0 85,180 85)") %>%
    st_segmentize(1) %>%
    st_set_crs('EPSG:4326')
plot(st_transform(l, 'EPSG:3995'), col = 'red', lwd = 2,
     graticule = TRUE, axes = TRUE, reset = FALSE)

How could this geometry be modified to have it cross the North Pole?

One would have to let it pass through (0 90) and (180, 90):

library(sf)
l <- st_as_sfc("LINESTRING(0 85,0 90,180 90,180 85)") %>%
    st_segmentize(1) %>%
    st_set_crs('EPSG:4326')
plot(st_transform(l, 'EPSG:3995'), col = 'red', lwd = 2,
     graticule = TRUE, axes = TRUE, reset = FALSE)

ex 4.2 For a typical polygon on \(S^2\), how can you find out ring direction?

Ring direction (clock-wise CW, counter clock-wise CCW) is unambiguous on \(R^2\) but not on \(S^2\): on \(S^2\) every polygon divides the sphere’s surface in two parts. When the inside of the polygon is taken as the area to the left when traversing the polygons’s points then for a small polygon, then ring direction is CCW if the area of the polygon is smaller than half of the area of the sphere. For polygons dividing the sphere in two equal parts (great circles such as the equator or meridians) ring direction is ambiguous.

ex 4.3 Are there advantages of using bounding caps over using bounding boxes? If so, list them.

Bounding caps may be more compact (have a smaller area compared to the bounding box corresponding to the same geometries), they need fewer parameters, and they are invariant under rotation of (the origins of) longitude and latitude.

For areas covering one of the poles, a bounding box will always need to have a longitude range that spans from -180 to 180, irrespective whether the geometry is centered around the pole.

ex 4.4 Why is, for small areas, the orthographic projection centered at the area a good approximation of the geometry as handled on \(S^2\)

Because that is the closest approximation of the geometry on \(R^2\).

ex 4.5 Fiji

For rnaturalearth::ne_countries(country = "Fiji", returnclass="sf"), check whether the geometry is valid on \(R^2\), on an orthographic projection centered on the country, and on \(S^2\). How can the geometry be made valid on S^2? Plot the resulting geometry back on \(R^2\). Compare the centroid of the country, as computed on \(R^2\) and on \(S^2\), and the distance between the two.

Valid on \(R^2\):

fi = rnaturalearth::ne_countries(country = "Fiji", returnclass="sf") %>%
        st_geometry()
s2 = sf_use_s2(FALSE)
## Spherical geometry (s2) switched off
st_is_valid(fi)
## [1] TRUE

Valid on orthographic projection:

ortho = "+proj=ortho +lon_0=178.6 +lat_0=-17.3"
st_transform(fi, ortho) %>% st_is_valid()
## [1] FALSE
plot(st_transform(fi, ortho), border = 'red')

The red line following the antimeridian makes the geometry invalid in this projection, and also on \(S^2\):

sf_use_s2(TRUE)
## Spherical geometry (s2) switched on
st_is_valid(fi)
## [1] FALSE

Make valid on \(S^2\), and plot:

fi.s2 = st_make_valid(fi)
st_is_valid(fi.s2)
## [1] TRUE
plot(st_transform(fi.s2, ortho), border = 'red')
title("valid")

where we see that the line at the antimeridian has disappeared. This makes plotting in \(R^2\) look terrible, with lines spanning the globe:

plot(fi.s2, axes = TRUE)

Compare the centroid of the country, as computed on \(R^2\) and on \(S^2\), and the distance between the two.

sf_use_s2(FALSE)
## Spherical geometry (s2) switched off
(c1 = st_centroid(fi))
## Warning in st_centroid.sfc(fi): st_centroid does not give correct centroids for
## longitude/latitude data
## Geometry set for 1 feature 
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 163.8532 ymin: -17.31631 xmax: 163.8532 ymax: -17.31631
## CRS:           +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## POINT (163.8532 -17.31631)
sf_use_s2(TRUE)
## Spherical geometry (s2) switched on
(c2 = st_centroid(fi.s2))
## Geometry set for 1 feature 
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 178.5684 ymin: -17.31562 xmax: 178.5684 ymax: -17.31562
## CRS:           +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## POINT (178.5684 -17.31562)
st_distance(c1, c2)
## Units: [m]
##         [,1]
## [1,] 1561723
sf_use_s2(s2)

Chapter 5: Attributes

ex 5.1. type of State

The appropriate value would be constant: there is no identity relationship of State to one of the counties in nc, and the value of State is constant through each county in the state (every point in every county in the state has this value for State).

ex 5.2. type of State for the entire state

Now, the unioned geometry is that of the state, and we can assign identity: there is only one state of North Carolina, an this geometry is its geometry.

ex 5.3. the AREA variable

The nc dataset is rather old, and did not come with an extensive report how, in detail, certain variables such as AREA were derived, so some detective work is needed here. How did people do this, more than three decades ago?

We can now compute area by

library(sf)
nc = read_sf(system.file("gpkg/nc.gpkg", package="sf"))
nc$AREA[1:10]
##  [1] 0.114 0.061 0.143 0.070 0.153 0.097 0.062 0.091 0.118 0.124
s2 = sf_use_s2(FALSE) # use spherical geometry:
## Spherical geometry (s2) switched off
nc$area = a_sph = st_area(nc)
nc$area[1:10]
## Units: [m^2]
##  [1] 1137388604  611077263 1423489919  694546292 1520740530  967727952
##  [7]  615942210  903650119 1179347051 1232769242
sf_use_s2(TRUE) # use ellipsoidal geometry:
## Spherical geometry (s2) switched on
nc$area = a_ell = st_area(nc)
nc$area[1:10]
## Units: [m^2]
##  [1] 1137107793  610916077 1423145355  694378925 1520366979  967504822
##  [7]  615794941  903423919 1179065710 1232475139
sf_use_s2(s2) # set back to original
cor(a_ell, a_sph)
## [1] 0.9999999

and this gives the area, in square metres, computed using either ellipsoidal or spherical geometry. We see that these are not identical, but nearly perfectly linearly correlated.

A first hypothesis might be a constant factor between the area and AREA variables. For this, we could try a power of 10:

nc$area2 = units::drop_units(nc$area / 1e10)
cor(nc$AREA, nc$area2)
## [1] 0.9998116
summary(lm(area2 ~ AREA, nc))
## 
## Call:
## lm(formula = area2 ~ AREA, data = nc)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.281e-03 -6.279e-04  6.328e-05  5.495e-04  2.746e-03 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0009781  0.0002692  -3.633 0.000448 ***
## AREA         1.0138124  0.0019882 509.920  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0009733 on 98 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 2.6e+05 on 1 and 98 DF,  p-value: < 2.2e-16
plot(area2 ~ AREA, nc)
abline(0, 1)

and we see a pretty good, close to 1:1 correspondence! But the factor 1e10 is strange: it does not convert square metres into a usual unit for area, neither for metric nor for imperial units.

Also, there are deviations from the 1:1 regression line. Could these be explained by the rounding of AREA to three digits? If rounding to three digits was the only cause of spread around the regression line, we would expect a residual standard error similar to the standard deviation of a uniform distribution with width .001, which is

sqrt(0.001^2/12)
## [1] 0.0002886751

but the one obtained int he regression is three times larger. Also, the units of AREA would be 1e10 \(m^2\), or 1e4 \(km^2\), which is odd and could ring some bells: one degree latitude corresponds roughly to 111 km, so one “square degree” at the equator corresponds roughly to \(1.11^2 \times 10^4\), and at 35 degrees North roughly to

111 ^ 2 * cos(35 / 180 * pi)
## [1] 10092.77

which closely corresponds to the regression slope found above.

We can compute “square degree” area by using the \(R^2\) area routines, e.g. obtained when we set the CRS to NA:

nc2 = nc
st_crs(nc2) = NA
nc2$area = st_area(nc2) # "square degrees"
plot(area ~ AREA, nc2)
abline(0,1)

cor(nc2$area, nc2$AREA)
## [1] 0.999983
summary(lm(area ~ AREA, nc2))
## 
## Call:
## lm(formula = area ~ AREA, data = nc2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.471e-04 -2.265e-04 -9.880e-06  2.714e-04  4.594e-04 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 7.436e-05  7.965e-05    0.934    0.353    
## AREA        9.996e-01  5.882e-04 1699.395   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0002879 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.888e+06 on 1 and 98 DF,  p-value: < 2.2e-16

We now get a much better fit, a near perfect correlation, and a regression standard error that corresponds exactly to what one would expect after rounding AREA to three digits.

A further “red flag” against the constant (1e10) conversion hypothesis is the spatial pattern of the regression residuals obtained by the first approach:

nc$resid = residuals(lm(area2 ~ AREA, nc))
plot(nc["resid"])

these residuals clearly show a North-South trend, corresponding to the effect that the Earth’s curvature has been ignored during the computation of AREA (ellipsoidal coordinates were treated as if they were Cartesian). “Square degrees” become smaller when going north.

The “unit” of the AREA variable is hence “square degree”. This is a meaningless unit for area on the sphere, because a unit square degree does not have a constant area.

ex 5.4 type of area

“area” is of type aggregate: it is a property of a polygon as a whole, not of each individual point in the polygon. It is extensive: if we cut a polygon in two parts, the total area is distributed over the parts.

ex 5.5 area-weighted interpolation

From the on-line version of the book we get the code that created the plot:

g = st_make_grid(st_bbox(st_as_sfc("LINESTRING(0 0,1 1)")), n = c(2,2))
par(mar = rep(0,4))
plot(g)
plot(g[1] * diag(c(3/4, 1)) + c(0.25, 0.125), add = TRUE, lty = 2)
text(c(.2, .8, .2, .8), c(.2, .2, .8, .8), c(1,2,4,8), col = 'red')

A question is how we can make g into an sf object with the right attribute values associated with the right geometries. We try values 1:4:

sf = st_sf(x = 1:4, geom = g)
plot(sf)

and see the order of the geometries: row-wise, bottom row first, so

sf = st_sf(x = c(1,2,4,8), geom = g)
plot(sf)

gives us the source object. We create target geometries by

dashed = g[1] * diag(c(3/4, 1)) + c(0.25, 0.125)
box = st_union(g)
c(dashed, box)
## Geometry set for 2 features 
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
## CRS:           NA
## POLYGON ((0.25 0.125, 0.625 0.125, 0.625 0.625,...
## POLYGON ((1 0, 0.5 0, 0 0, 0 0.5, 0 1, 0.5 1, 1...

and can call st_interpolate_aw to compute the area-weighted interpolations:

st_interpolate_aw(sf, c(dashed, box), extensive = TRUE)
## Warning in st_interpolate_aw.sf(sf, c(dashed, box), extensive = TRUE):
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
## Simple feature collection with 2 features and 1 field
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
## CRS:           NA
##       x                       geometry
## 1  1.75 POLYGON ((0.25 0.125, 0.625...
## 2 15.00 POLYGON ((1 0, 0.5 0, 0 0, ...
st_interpolate_aw(sf, c(dashed, box), extensive = FALSE)
## Warning in st_interpolate_aw.sf(sf, c(dashed, box), extensive = FALSE):
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
## Simple feature collection with 2 features and 1 field
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
## CRS:           NA
##          x                       geometry
## 1 2.333333 POLYGON ((0.25 0.125, 0.625...
## 2 3.750000 POLYGON ((1 0, 0.5 0, 0 0, ...

This generates a warning, which we can get rid of by setting the agr to constant:

st_agr(sf) = "constant"
st_interpolate_aw(sf, c(dashed, box), FALSE)
## Simple feature collection with 2 features and 1 field
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
## CRS:           NA
##          x                       geometry
## 1 2.333333 POLYGON ((0.25 0.125, 0.625...
## 2 3.750000 POLYGON ((1 0, 0.5 0, 0 0, ...

Chapter 6: Data Cubes

ex 6.1.

Why is it difficult to represent trajectories, sequences of \((x,y,t)\) obtained by tracking moving objects, by data cubes as described in this chapter?

  • rounding \((x,y,t)\) to the discrete set of dimension values in a data cube may cause loss of information
  • if the dimensions all have a high resolution, data loss is limited but the data cube will be very sparse; this will only be effective if a system capable of storing sparse data cubes is used (e.g. SciDB, TileDB)

ex 6.2.

In a socio-economic vector data cube with variables population, life expectancy, and gross domestic product ordered by dimensions country and year, which variables have block support for the spatial dimension, and which have block support for the temporal dimension?

  • population has spatial block support (total over an area), typically not temporal block support (but the population e.g. on a particular day of the year)
  • life expectancy is calculated over the total population of the country, and as such has spatial block support; it has temporal block support as the number of deaths over a particular period are counted, it is not clear whether this always corresponds to a single year or a longer period.
  • GDP has both spatial and temporal block support: it is a total over an area and a time period.

ex 6.3.

The Sentinel-2 satellites collect images in 12 spectral bands; list advantages and disadvantages to represent them as (i) different data cubes, (ii) a data cube with 12 attributes, one for each band, and (iii) a single attribute data cube with a spectral dimension.

  • as (i): it would be easy to cope with the differences in cell sizes;
  • as (ii): one would have to cope with differences in cell sizes (10, 20, 60m), and it would not be easy to consider the spectral reflectance curve of individual pixels
  • as (iii): as (ii) but it would be easier to consider (analyse, classify, reduce) spectral reflectance curves, as they are now organized in a dimension

ex 6.4.

Explain why a curvilinear raster as shown in figure 1.5 can be considered a special case of a data cube.

  • Curvilinear grids do not have a simple relationship between dimension index (row/col, i/j) to coordinate values (lon/lat, x/y): one needs both row and col to find the coordinate pair, and from a coordinate pair a rather complex look-up to find the corresponding row and column.

ex 6.5.

Explain how the following problems can be solved with data cube operations filter, apply, reduce and/or aggregate, and in which order. Also mention for each which function is applied, and what the dimensionality of the resulting data cube is (if any):

ex 6.5.1

from hourly \(PM_{10}\) measurements for a set of air quality monitoring stations, compute per station the amount of days per year that the average daily \(PM_{10}\) value exceeds 50 \(\mu g/m^3\)

  • convert measured hourly values into daily averages: aggregate (from hourly to daily, function: mean)
  • convert daily averages into TRUE/FALSE whether the daily average exceeds 50: apply (function: larger-than)
  • compute the number of days: reduce time (function: sum)

This gives a one-dimensional data cube, with dimension “station”

ex 6.5.2

for a sequence of aerial images of an oil spill, find the time at which the oil spill had its largest extent, and the corresponding extent

  • for each image, classify pixels into oil/no oil: apply (function: classify)
  • for each image, compute size (extent) of oil spill: reduce space (function: sum)
  • for the extent time series, find time of maximum: reduce time (function: which.max, then look up time)

This gives a zero-dimensional data cube (a scalar).

ex 6.5.3

from a 10-year period with global daily sea surface temperature (SST) raster maps, find the area with the 10% largest and 10% smallest temporal trends in SST values.

  • from daily SST to trend values per pixel: reduce time (function: trend function, lm)
  • from trend raster, find 10- and 90-percentile: reduce space (function: quantile)
  • using percentiles, threshold the trend raster: apply (function: less than / more than)

This gives a two-dimensional data cube (or raster layer: the reclassified trend raster).

Chapter 7: sf, stars

ex. 7.1

Find the names of the nc counties that intersect LINESTRING(-84 35,-78 35); use [ for this, and use st_join() for this.

(file = system.file("gpkg/nc.gpkg", package="sf"))
## [1] "/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/gpkg/nc.gpkg"
nc = st_read(file)
## Reading layer `nc.gpkg' from data source 
##   `/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/gpkg/nc.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 100 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS:  NAD27
line = st_as_sfc("LINESTRING(-84 35,-78 35)", crs = st_crs(nc))
nc[line,]$NAME
##  [1] "Jackson"     "Mecklenburg" "Macon"       "Sampson"     "Cherokee"   
##  [6] "Cumberland"  "Union"       "Anson"       "Hoke"        "Duplin"     
## [11] "Richmond"    "Clay"        "Scotland"
st_join(st_sf(line), nc)$NAME # left join: `line` should be first argument
##  [1] "Jackson"     "Mecklenburg" "Macon"       "Sampson"     "Cherokee"   
##  [6] "Cumberland"  "Union"       "Anson"       "Hoke"        "Duplin"     
## [11] "Richmond"    "Clay"        "Scotland"

ex. 7.2

Repeat this after setting sf_use_s2(FALSE), and compute the difference (hint: use setdiff()), and color the counties of the difference using color ‘#00880088’.

# save names first:
sf_use_s2(TRUE)
names_with_s2 = nc[line,]$NAME
sf_use_s2(FALSE)
## Spherical geometry (s2) switched off
nc[line,]$NAME
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
##  [1] "Macon"      "Sampson"    "Cherokee"   "Cumberland" "Union"     
##  [6] "Anson"      "Hoke"       "Duplin"     "Richmond"   "Clay"      
## [11] "Scotland"
(diff = setdiff(names_with_s2, nc[line,]$NAME))
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## [1] "Jackson"     "Mecklenburg"
plot(st_geometry(nc))
plot(st_geometry(nc)[nc$NAME %in% diff], col = "#00880088", add = TRUE)

ex. 7.3

Plot the two different lines in a single plot; note that R will plot a straight line always straight in the projection currently used; st_segmentize can be used to add points on straight line, or on a great circle for ellipsoidal coordinates.

plot(st_geometry(nc))
plot(st_geometry(nc)[nc$NAME %in% diff], col = "#00880088", add = TRUE)
plot(line, add = TRUE)
plot(st_segmentize(line, units::set_units(10, km)), add = TRUE, col = 'red')

ex. 7.4

NDVI, normalized differenced vegetation index, is computed as (NIR-R)/(NIR+R), with NIR the near infrared and R the red band. Read the L7_ETMs.tif file into object x, and distribute the band dimensions over attributes by split(x, "band"). Then, add attribute NDVI to this object by using an expression that uses the NIR (band 4) and R (band 3) attributes directly.

library(stars)
(x = read_stars(system.file("tif/L7_ETMs.tif", package = "stars")))
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##              Min. 1st Qu. Median     Mean 3rd Qu. Max.
## L7_ETMs.tif     1      54     69 68.91242      86  255
## dimension(s):
##      from  to  offset delta                       refsys point values x/y
## x       1 349  288776  28.5 UTM Zone 25, Southern Hem... FALSE   NULL [x]
## y       1 352 9120761 -28.5 UTM Zone 25, Southern Hem... FALSE   NULL [y]
## band    1   6      NA    NA                           NA    NA   NULL
(x.spl = split(x))
## stars object with 2 dimensions and 6 attributes
## attribute(s):
##     Min. 1st Qu. Median     Mean 3rd Qu. Max.
## X1    47      67     78 79.14772      89  255
## X2    32      55     66 67.57465      79  255
## X3    21      49     63 64.35886      77  255
## X4     9      52     63 59.23541      75  255
## X5     1      63     89 83.18266     112  255
## X6     1      32     60 59.97521      88  255
## dimension(s):
##   from  to  offset delta                       refsys point values x/y
## x    1 349  288776  28.5 UTM Zone 25, Southern Hem... FALSE   NULL [x]
## y    1 352 9120761 -28.5 UTM Zone 25, Southern Hem... FALSE   NULL [y]
x.spl$NDVI = (x.spl$X4 - x.spl$X3)/(x.spl$X4 + x.spl$X3)
plot(x.spl["NDVI"])

ex. 7.5

Compute NDVI for the L7_ETMs.tif image by reducing the band dimension, using st_apply and an a function ndvi = function(x) { (x[4]-x[3])/(x[4]+x[3]) }. Plot the result, and write the result to a GeoTIFF.

ndvi_fn = function(x) { (x[4]-x[3])/(x[4]+x[3]) }
ndvi = st_apply(x, 1:2, ndvi_fn)
plot(ndvi)

write_stars(ndvi, "ndvi.tif")

an alternative function is

ndvi_fn = function(x1,x2,x3,x4,x5,x6) { (x4-x3)/(x4+x3) }
ndvi2 = st_apply(x, 1:2, ndvi_fn)
all.equal(ndvi, ndvi2)
## [1] TRUE

This latter function can be much faster, as it is called for chunks of data rather than for individual pixels.

ex. 7.6

Use st_transform to transform the stars object read from L7_ETMs.tif to EPSG:4326. Print the object. Is this a regular grid? Plot the first band using arguments axes=TRUE and explain why this takes such a long time.

(x_t = st_transform(x, 'EPSG:4326'))
plot(x_t[,,,1], axes = TRUE)

the report says it is a curvilinear grid. Plotting takes so long because for curvilinear grids, each cell is converted to a small polygon and then plotted.

ex. 7.7

Use st_warp to warp the L7_ETMs.tif object to EPSG:4326, and plot the resulting object with axes=TRUE. Why is the plot created much faster than after st_transform?

x_w = st_warp(x, crs = 'EPSG:4326')
plot(x_w[,,,1], reset = FALSE)
## downsample set to c(0,0,1)
plot(st_as_sfc(st_bbox(x_w)), col = NA, border = 'red', add = TRUE)

Plotting is faster now because we created a new regular grid. Note that the grid border does not align perfectly with the square formed by the bounding box (using straight lines in an equidistant rectangular projection): white grid cells indicate the misalignment due to warping/transforming.

Chapter 8: large datasets

ex 8.1

For the S2 image (above), find out in which order the bands are using st_get_dimension_values(), and try to find out (e.g. by internet search) which spectral bands / colors they correspond to.

f = "sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip"
  granule = system.file(file = f, package = "starsdata")
file.size(granule)
## [1] 769461997
base_name = strsplit(basename(granule), ".zip")[[1]]
s2 = paste0("SENTINEL2_L1C:/vsizip/", granule, "/", base_name, 
    ".SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
(p = read_stars(s2, proxy = TRUE))
## stars_proxy object with 1 attribute in 1 file(s):
## $`MTD_MSIL1C.xml:10m:EPSG_32632`
## [1] "[...]/MTD_MSIL1C.xml:10m:EPSG_32632"
## 
## dimension(s):
##      from    to offset delta                refsys point    values x/y
## x       1 10980  3e+05    10 WGS 84 / UTM zone 32N    NA      NULL [x]
## y       1 10980  6e+06   -10 WGS 84 / UTM zone 32N    NA      NULL [y]
## band    1     4     NA    NA                    NA    NA B4,...,B8
st_get_dimension_values(p, "band")
## [1] "B4" "B3" "B2" "B8"

ex 8.2

Compute NDVI for the S2 image, using st_apply and an an appropriate ndvi function. Plot the result to screen, and then write the result to a GeoTIFF. Explain the difference in runtime between plotting and writing.

ndvi_fn = function(r, g, b, nir) (nir-r)/(nir+r)
ndvi = st_apply(p, 1:2, ndvi_fn)
plot(ndvi)
## downsample set to c(19)

Alternatively, one could use

ndvi_fn = function(r, g, b, nir) (nir-r)/(nir+r)

but that is much less efficient. Write to a tiff:

system.time(write_stars(ndvi, "ndvi.tif"))
## ================================================================================
##    user  system elapsed 
## 221.005   6.905  55.413

The runtime difference is caused by the fact that plot downsamples, so computes a very small fraction of the available pixels, where write_stars computes all pixels, and then writes them.

ex 8.3

Plot an RGB composite of the S2 image, using the rgb argument to plot(), and then by using st_rgb() first.

plot(p, rgb = 1:3)
## downsample set to c(19)

plot(st_rgb(p[,,,1:3], maxColorValue=13600))
## downsample set to c(19)

ex 8.4

select five random points from the bounding box of S2, and extract the band values at these points. What is the class of the object returned? Convert the object returned to an sf object.

pts =  p %>% st_bbox() %>% st_as_sfc() %>% st_sample(5)
(p5 = st_extract(p, pts))
## stars object with 2 dimensions and 1 attribute
## attribute(s):
##                                Min. 1st Qu. Median    Mean 3rd Qu. Max.
## MTD_MSIL1C.xml:10m:EPSG_32632     0 1699.25 2209.5 2021.95  2837.5 3648
## dimension(s):
##          from to offset delta                refsys point
## geometry    1  5     NA    NA WGS 84 / UTM zone 32N  TRUE
## band        1  4     NA    NA                    NA    NA
##                                                         values
## geometry POINT (344398.1 5986860),...,POINT (318560.4 5980671)
## band                                                 B4,...,B8
class(p5)
## [1] "stars"
st_as_sf(p5)
## Simple feature collection with 5 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 318560.4 ymin: 5911044 xmax: 407394.7 ymax: 5996965
## Projected CRS: WGS 84 / UTM zone 32N
##     B4   B3   B2   B8                 geometry
## 1 3099 3165 3648 3088 POINT (344398.1 5986860)
## 2    0    0    0    0 POINT (407394.7 5911044)
## 3 2220 2199 2526 1973 POINT (370068.3 5947493)
## 4 1947 1806 2030 1379 POINT (404787.1 5996965)
## 5 2477 2731 3397 2754 POINT (318560.4 5980671)

ex 8.5

For the 10 km radius circle around POINT(390000 5940000), compute the mean pixel values of the S2 image when downsampling the images with factor 30, and on the original resolution. Compute the relative difference between the results.

b = st_buffer(st_sfc(st_point(c(390000, 5940000)), crs = st_crs(p)), 
    units::set_units(10, km))
plot(p[,,,1], reset = FALSE, axes = TRUE)
## downsample set to c(19)
plot(b, col = NA, border = 'green', add = TRUE)

p1 = st_as_stars(p, downsample = 30)
a1 = aggregate(p1, b, mean)

For the full resolution; this takes a while:

system.time(a2 <- aggregate(p, b, mean))
## Warning in c.stars(structure(list(MTD_MSIL1C.xml.10m.EPSG_32632 =
## structure(c(1032.86211707414, : along argument ignored; maybe you wanted to use
## st_redimension?
##    user  system elapsed 
##  77.600   1.037  70.074

Relative differences: we will work on the array of the stars objects:

(a1[[1]] - a2[[1]])/((a1[[1]]+a2[[1]])/2)
##             [,1]         [,2]        [,3]         [,4]
## [1,] 0.001055567 0.0009431265 0.001103254 7.781836e-05

Alternatively one could convert a1 and a2 to a data.frame, using as.data.frame, and work on the third column of the data frames.

ex 8.6

Use hist to compute the histogram on the downsampled S2 image. Also do this for each of the bands. Use ggplot2 to compute a single plot with all four histograms in facets.

hist(p1)

hist(p1[,,,1])

hist(p1[,,,2])

hist(p1[,,,3])

hist(p1[,,,4])

library(ggplot2)
ggplot(as.data.frame(p1), aes(x = MTD_MSIL1C.xml.10m.EPSG_32632)) +
        geom_histogram() + facet_wrap(~band)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ex 8.7

Use st_crop to crop the S2 image to the area covered by the 10 km circle. Plot the results. Explore the effect of setting argument crop = FALSE

plot(st_crop(p, b))
## downsample set to c(3)

plot(st_crop(p, b, crop = FALSE))
## downsample set to c(19)

ex 8.8

With the downsampled image, compute the logical layer where all four bands have pixel values higher than 1000. Use a raster algebra expression on the four bands (use split first), or use st_apply for this.

p_spl = split(p1)
p_spl$high = p_spl$B4 > 1000 & p_spl$B3 > 1000 & p_spl$B2 > 1000 & p_spl$B8 > 1000
plot(p_spl["high"])

alternative, using st_apply on the band dimension

p2 = st_apply(p1, 1:2, function(x) all(x > 1000))
plot(p2)